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Abstract 

Tertiary structure prediction of a protein from its amino acid sequence is one of tine major challenges in the field of 
bioinformatics. Hierarchical approach is one of the persuasive techniques used for predicting protein tertiary structure, 
especially in the absence of homologous protein structures. In hierarchical approach, intermediate states are predicted like 
secondary structure, dihedral angles, C^-C" distance bounds, etc. These intermediate states are used to restraint the protein 
backbone and assist its correct folding. In the recent years, several methods have been developed for predicting dihedral 
angles of a protein, but it is difficult to conclude which method is better than others. In this study, we benchmarked the 
performance of dihedral prediction methods ANGLOR and SPINE X on various datasets, including independent datasets. 
TANGLE dihedral prediction method was not benchmarked (due to unavailability of its standalone) and was compared with 
SPINE X and ANGLOR on only ANGLOR dataset on which TANGLE has reported its results. It was observed that SPINE X 
performed better than ANGLOR and TANGLE, especially in case of prediction of dihedral angles of glycine and proline 
residues. The analysis suggested that angle shifting was the foremost reason of better performance of SPINE X. We further 
evaluated the performance of the methods on independent ccPDBSO dataset and observed that SPINE X performed better 
than ANGLOR. 
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introduction 

One of the ultimate goals of bioinformatics is the prediction of 
protein tertiary structure from its primary sequence. In the past, 
several techniques were developed for predicting tertiary structure 
of a protein that includes homology and threading based 
approaches [1,2,3,4,5]. The performance of these methods 
depends on the homology between query and target sequences. 
Therefore, these techniques work best when homologous tem- 
plates are available and are not designed to work in the absence of 
homologous protein sequence/structure. Hierarchical approach 
provides an alternate to predict the structure of a protein when it is 
difficult to detect homologous protein sequences from protein 
databank (PDB). In this approach, intermediate states such as 
secondary structure states [6,7,8], super-secondary structures 
[9,10,11], turns [12,13,14,15,16,17], C°=-C°= distance bounds, 
backbone dihedral angle of proteins, etc. are used as restrains to 
assist the correct folding of protein backbone [18,19,20]. Recently, 
Kurgan et al. reviewed the progress in the field of intermediate 
state or one-dimension prediction [21]. It was observed that 
predicted secondary structure is useful in the prediction of 
disorder, flexible region, fold recognition and function prediction. 
It was also observed that dihedral angle (or backbone torsion 
angle) and secondary structures of a protein are highly correlated. 
In Ramachandran plot, phi-psi angles generally cluster around 
phi = -60°, psi = -40° for helix, phi = - 120°, psi = 120° for beta- 
strand, and around phi = 60°, psi = 40° for L-helix [22]. Dihedral 



angle omega is almost fixed at 180°and 0° due to planarity of 
partial di-peptide bond [23]. Apart from Helix and Sheet, which 
have defined phi-psi region, coil residues are distributed in most of 
the Ramachandran plot. Strong correlations exist between the 
dihedral state of a residue and the immediate sequence neighbor 
[24]. This correlation helps in accurately defming the local 
ordering/confirmation in proteins. On the other hand, secondary 
structure predictions do not distinguish one loop conformation to 
another, but backbone dihedral angles accurately provide the local 
structural information that is useful in defining highly variable loop 
regions in a primary sequence. Backbone torsion angles signifi- 
candy reduce the conformational search space for tertiary 
structure prediction. Thus, prediction of dihedral angle is 
especially useful for predicting tertiary structure of proteins. 

Dihedral angle prediction has many applications in protein 
structure prediction that includes: (i) supplement for better 
secondary structure prediction [25,26,27], (ii) generation of 
multiple sequence alignment [28,29], (iii) identification of protein 
folds [30,31,32] and (iv) fragment-free tertiary structure prediction 
[19]. Initially, dihedral prediction methods were developed for 
predicting few discrete states based on their distribution in 
Ramachandran plot [33,34,35,36,37,38]. Wood el al. first 
developed a method for prediction of real values of dihedral 
angle psi and used this information for prediction of the protein 
secondary structure with high accuracy [26]. Later, Real-SPINE 
(1.0, 2.0 and 3.0), ANGLOR and TANGLE were developed to 
predict the real value of both phi and psi dihedral angle 
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[39,40,41,42,43]. Real-SPINE was developed on a dataset of 2640 
proteins with MAE of 54° for psi angle. The prediction was further 
improved in successive methods Real-SPINE 2.0 (38'V25") for 
psi/phi angle respectively, Real-SPINE 3.0 (36°/22°), SPINE X 
(35° for psi) and SPINE XI (33.4° for psi) [44]. The new version of 
SPINE X incorporated the SPINE XI algorithm and it has MAE 
33.4° equivalent to SPINE XI. In our study we have used the new 
version of SPINE X. ANGLOR and TANGLE were developed on 
a dataset of 1989 proteins and achieved an MAE of 46°/28° 
(ANGLOR), 44.6°/27.8° (TANGLE). 

Presently, it is difficult to conclude which method among 
SPINE X, ANGLOR and TANGLE performs better than other, 
as these methods have been tested on different datasets. In this 
study, we have performed a benchmarking for principal prediction 
methods SPINE X and ANGLOR. These methods were evaluated 
on three different datasets; (i) SPINE X (2479 protein chains), (ii) 
ANGLOR (1989 protein chains), and (iii) a latest dataset from 
ccPDB (4682 protein chains) [40,42,45]. As the standalone of 
TANGLE method was not available, we were unable to 
benchmark TANGLE method on all datasets. Instead, we 
compared it with SPINE X and ANGLOR methods, only on 
the ANGLOR dataset on which TANGLE has reported its results. 
We have also analyzed why different algorithms perform 
differently just for few amino acids with respect to their secondary 
structure. We have also provided the raw data (prediction results 
of methods on different datasets) in an easily understandable text 
format, which can be downloaded from (http://crdd.osdd.net/ 
raghava/ download/ rawdata.tgz). 

Materials and Methods 

Datasets Used for Evaluation 

In this study, we evEiluated the performance of different 
methods on datasets used in previous studies. In addition, we 
have also created new dataset from PDB using ccPDB server. 

Following is the description of these datasets: -. 

SPINE X dataset. This dataset contains 2479 protein chains 
that were obtained from SPINE X server (http://sparks. 
informatics.iupui.edu/SPINE-X/list.spinex.tgz). [40] . 

ANGLOR dataset. We obtained this dataset from ANGLOR 
web site available at URL http:/ /zhanglab. ccmb.med.umich.edu/ 
ANGLOR/benchmark.html. Out of the total chains, 500 chains 
were used as training data, 460 as validation data and 1029 as 
testing data [42]. 

ccPDB Dataset. We created new dataset using the database 
cum web server ccPDB "compilation and creation of datasets from 
PDB" (http://crdd.osdd.net/raghava/ccpdb) [45]. We extracted 
those protein chains from ccPDB that satisfy following three 
criteria's i) protein chains having resolution better than 2A°, ii) 
Rfree less than 0.25 and iii) number of residues in each chain 
between 50 to 3000. We created a non-redundant dataset having 
sequence identity cut-off ?iQ"/« with 4682 protein chains. This 
dataset was named accordingly to its serjuence identity level i.e. 
ccPDB30 dataset, which consists of chains having sequence 
identity less than 30%. The list of PDB IDs used in ccPDB30 
dataset is provided in Table SI of File S2. For more information 
on PDB chains sequence identity level, please refer to (ftp:// 
resources.rcsb.org/sequence/clusters). We obtained the dihedral 
angle of all PDB chains using DSSP software [46] . 



secondary structure were used as input to predict the normalized 
solvent accessibility value of a residue. The normalized solvent 
accessibility value was combined with the above stated input 
features to predict the real value dihedral angles. This method is 
then combined with a discrete state classifier to improve the 
accuracy of predicted angles. The resulting predicted angles were 
further refined with a conditional random field model to give the 
final predicted angles. The method is available at http: //sparks, 
informatics. iupui.edu/SPINE-X/index.html. 

ANGLOR. The method is a composite machine-learning 
algorithm using neural network for phi angle prediction and 
Support Vector Machine (SVM) for psi angle prediction. In the 
first step, sequence profile is used to predict secondary structure 
and solvent accessibility value of a residue. In the next step, three 
features: sequence profile, secondary structure and solvent 
accessibility were used as input vector to predict dihedral angles. 
The method is available at http://zhanglab.ccmb.med.umich. 
edu/ ANGLOR/. 

TANGLE. This method is based on two level prediction using 
SVM based regression approach. In the first level, features derived 
from sequence (PSSM profiles, secondary structure, solvent 
accessibility, native disorder, sequence length and sequence 
weight) are used as input to predict initial dihedral angles. The 
predicted dihedral angles from first level are used as input in the 
second level to predict the final refined dihedral angles. TANGLE 
is available at http://sunflower.kuicr.kyoto-u.ac.jp/~sjn/ 
TANGLE/webserver.html. 

Performance Evaluation. We used Mean Absolute Error 
(MAE) as described by Wu et al. [42], for assessing the prediction 
of phi/ psi angles throughout the study. According to Wu et al. the 
MAE is defined as the average difference in degrees between the 
predicted (P) and the experimental values (E) of all residues. MAE 
measures the accuracy for continuous variable's e.g. dihedral 
angles and is the standard practice of evaluation of dihedral angle 
prediction methods. [39,40,41,42,43]. MAE is defined by the 
following formulae: 



MAE=-Y,\yi-Xi\ 



(1) 



where, x, and 3), are the actual (observed) and predicted dihedral 
angles of the residue and is the total number of residues. 

To test whether the obtained MAE difference while comparing 
the methods is statistically significant, we apphed Wilcoxon signed 
rank test using coin package [47] in R statistical programming 
language [48] to calculate the j!)-value for the comparison. We also 
reported Root Mean Square Error (RMSE) and Pearson 
correlation coefficient (PCC) achieved by all the methods on all 
the datasets. However, it should be kept in mind that in assessing 
the quality of prediction of dihedral angles, PCC appears to be a 
less robust measure [40,41,42]. RMSE and PCC are defmed by 
the following formula: 



RMSE= 



\ 



-Xi 



(2) 



Dihedral Angle Prediction IVlethods 

SPINE X. The method utilizes a guided-learning artificial 
neural network for prediction of dihedral angle. In the first step, 
sequence profile, seven representative physical parameters and 



PCC- 



J2i=iixi-x){yi-y) 



[Ef=i(^.-*)']xJ[Ef=i(3^.-j) 



(3) 
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Table 3. Performance of random prediction method, in terms of MAE, on ANGLOR, SPINE X and ccPDB30 datasets for the 
prediction of phi and psi dihedral angle. 



Random PHI Prediction 



Random PSI prediction 



Residue/Dataset 


ANGLOR 


SPINE X 


CCPDB30 


ANGLOR 


SPINE X 


CCPDB30 


ALA 


40.4 


34.3 


33.7 


83.6 


82.3 


83.0 


CYS 


44.6 


42.7 


42.2 


88.5 


88.7 


88.3 


ASP 


47.8 


42.2 


41 .5 


84.8 


83.2 


83.6 


GLU 


40.3 


33.2 


33.3 


83.6 


78.9 


80.8 


PHE 


43.9 


40.6 


40.5 


88.0 


89.5 


89.4 


GLY 


88.5 


87.8 


88.2 


87.3 


88.1 


88.2 


HIS 


49.2 


45.7 


46.7 


89.6 


88.7 


87.1 


ILE 


34.9 


32.9 


32.5 


88.1 


88.5 


88.1 


LYS 


44.0 


38.1 


38.4 


85.9 


84.4 


85.6 


LEU 


34.3 


30.5 


30.2 


87.9 


85.4 


86.2 


MET 


46.8 


40.7 


39.2 


88.5 


86.7 


87.7 


ASN 


59.5 


55.6 


56.4 


83.8 


81 .8 


81 .0 


PRO 


14.0 


1 3.2 


12.4 


87.7 


87.7 


87.4 


GLN 


42.2 


37.3 


37.7 


84.8 


81 .2 


84.3 


ARG 


42.9 


39.2 


39.4 


86.0 


85.2 


86.3 


SER 


49.7 


42.8 


42.1 


89.7 


89.9 


89.7 


THR 


41.4 


36.8 


35.7 


89.0 


89.8 


88.6 


VAL 


37.7 


34.5 


34.0 


86.7 


86.9 


86.0 


TRP 


40.4 


38.2 


38.8 


90.1 


88.7 


89.0 


TYR 


42.2 


41.4 


40.6 


89.6 


89.3 


89.2 


ALL 


44.7 


40.4 


40.2 


86.8 


85.8 


86.1 


Helix (H) 


36.3 


32.0 


32.0 


82.7 


78.4 


80.5 


Sheet (E) 


44.9 


44.2 


43.2 


90.7 


93.7 


92.2 


Coil (C) 


51.1 


46.4 


45.9 


87.9 


88.2 


87.5 



doi:l 0.1 371/journal.pone.01 05667.t003 



where x, and 31, are the actual (observed) and predicted dihedral 
angles of the f' residue; x and y are the mean values of x and 31, 
and is the total number of residues. 

As the nature of the data is circular, we calculated the difference 
between actual and predicted/mean dihedral angle as per Wu et 
al. [42] for calculating both RMSE and PCC. 

Results 

Evaluation of Existing iVIethods 

We evaluated the performance of existing methods on different 
datasets used in the past for developing prediction method. In 
addition, the performance of existing methods was also evaluated 
on new or independent dataset generated in this study. We also 
performed amino acid specific random based prediction as 
described by Wu et al. [42] and Song el al. [39] to perform the 
base line comparison of the methods with a random method. Wu 
et al. took the dihedral angles randomly from amino acid specific 
pool obtained using training dataset of 500 proteins and repeated 
this random process 10,000 times to get a stable distribution. We 
also adopted the same process for random prediction. On SPINE 
X and ccPDBSO datasets, the whole respective dataset was used for 
amino acid specific pool generation to obtain random prediction. 
The performance of these methods on various datasets is described 
below: 



ANGLOR dataset. First, we evaluated the performance of 
methods on ANGLOR dataset. As shown in Table 1, for dihedral 
angle phi, ANGLOR, TANGLE and SPINE X achieved MAE of 
28.20°, 27.80° and 24.83°, respectively between actual and 
predicted phi. These results show that SPINE X performs better 
than other methods. SPINE X achieved MAE of 56.70° and 9.63° 
for glycine and proline, which is much better than ANGLOR 
(75.1° and 15.2°) and TANGLE (84.1° and 13.6°). Both SPINE X 
and TANGLE performed better than ANGLOR in case of serine 
and threonine residues. TANGLE performed relatively better than 
other two methods for helix forming residues. The result shows 
that SPINE X performs better among all methods, but the 
difiference between all three methods is less than 4° (Table 1). In 
case of prediction of psi angle, SPINE X performed better for 
almost all residues, especially for glycine and proline residues. 
ANGLOR, TANGLE and SPINE X have MAE 46.40°, 44.64° 
and 38.80° respectively (Table 2). Again, TANGLE performed 
better than other methods for helix forming residues. The above 
results clearly indicate that SPINE X is outperforming other two 
methods by a margin of around 6°. The MAE of SPINE X for phi 
and psi angles on this dataset is significantly smaller than 
ANGLOR with a /(-value of <<0.001 and «0.001 respectively, 
using WUcoxon signed rank test. With respect to random 
prediction (Table 3), both ANGLOR and SPINE X performed 
significaiidy better with MAE difference 16.5° (^-value<<0.001) 
and 19.9° (^-value« 0.001) for phi and 40.8° ^-value«0.001) 
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Figure 1. Normal psi angle distribution of glycine. 

doi:1 0.1 371/journal.pone.0105667.g001 

and 48.0° (^-value<<0.001) for psi respectively. For botii phi and least RMSE in predicting phi dihedral angle (Table S2, S3 in File 
psi dihedral angles, SPINE X has high PCC than ANGLOR, S2). 
TANGLE (as reported) and random prediction. SPINE X has 




0 20 40 60 80 100 120 140 160 180 200 220 240 260 280 300 320 340 360 



Figure 2. Psi angle distribution of glycine after shifting the angles. 

doi:l 0.1 371/journal.pone.01 05667.g002 
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Figure 3. Normal psi angle distribution of Alanine. 
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SPINE X dataset. Next, we evaluated the performance of 
methods on SPINE X dataset. SPINE X achieved MAE of 20.8" 
and performed better than ANGLOR with MAE 24.31° for phi 
angle. The results were more pronounced for glycine, proline, 
serine and threonine residues. (Table 2). The same trend follows in 
case of psi angle; SPINE X performed better for glycine, proline, 
serine and threonine having MAE 46.8°, 38.17°, 39.49° and 
37.18° as compared to ANGLOR with MAE 65.17°, 58.59°, 
52.6°, and 49.46° respectively. OveraU ANGLOR achieved MAE 
of 43.52° and SPINE X achieved 33.5° (Table 2). It is evident 
from the results that SPINE X performs better than ANGLOR, 
especially in case of psi angle. The difference of MAE between 
SPINE X and ANGLOR for phi (3.5°) and psi (10°) angles on this 
dataset, corresponds to a p-value of <<0.001 using WUcoxon 
signed rank test. Both ANGLOR and SPINE X performed 
significantly better than amino acid specific random prediction 
method (Table 3) with MAE difference of 16.1° (^-value<<0.001) 
and 19.6° (^-value«0.001) for phi and 42.3° ^-value«0.001) 
and 52.3° (j^-value«0.001) for psi, respectively. SPINE X has 
highest PCC as compared to ANGLOR and random prediction 
for phi and psi dihedral angles (Table S2, S3 in File S2). 

ccPDBSO Dataset. We also evaluated the performance of 
SPINE X and ANGLOR on independent ccPDB30 dataset. For 
dihedral angle phi, SPINE X achieved MAE of 21.23° and 
ANGLOR achieved 24.46°. SPINE X performed much better for 
glycine and proline having MAE 19.33° and 5.8°, which is lower 
than ANGLOR. Similarly, in case of psi angle, SPINE X achieved 



M^E 17.29° and 18.45°, which is lower than ANGLOR for 
glycine and proline residues respectively. SPINE X having M^E 
of 35.70° performed much better than ANGLOR with MAE of 
44.48°. The results clearly demonstrate the superior performance 
of SPINE X over ANGLOR (Table 1, 2). Using Wilcoxon signed 
rank test, the MAE difference between SPINE X and ANGLOR 
for phi angle (3.3°) corresponds to a ^-value<<0.001 and for psi 
angle (8.8°) j!?-value«0.001. Both SPINE X and ANGLOR 
performed significantly better than random prediction (Table 3) 
with /^-values (phi<<0.001; psi<<0.001) and (phi<<0.001; 
psi<<0.001) respectively. SPINE X has least RMSE and highest 
PCC for phi dihedral angle on this dataset (Table S2, S3 in File 
S2). 

Effect of Angle Shifting in SPINE X 

The results suggest that SPINE X performs better than 
ANGLOR and TANGLE for the prediction of psi angle. Amino 
acid wise comparison reveals that SPINE X performs better than 
ANGLOR and TANGLE especially in glycine, proline, serine and 
threonine amino acids. Interestingly, both glycine and proline do 
not follow the standard Ramachandran plot. In case of glycine of 
CCPDB30 dataset, ANGLOR achieved MAE of 65.32° and SPINE 
X has 48.03° for psi angle. It has been observed that distribution of 
psi angle for glycine in helix region has a range between —55° to 
-10°, sheet ranges from -180° to -130°, 110° to 180° and coil 
occurs mainly in -180° to -120°, -45° to 45° and 130° to 180° 
as shown in Figure 1. SPINE X shifted the angles by adding 100° 
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to the angles between —100° and 180° and adding 460° to the 
angles between —180° and — 100°, thus shifting the angles from — 
180° -180° to 0° -360° (Figure 2). SPINE X authors have 
suggested that this shifting ensures that a minimum number of 
angles occur at the end of the sigmoidal function, making the data 
more linear and continuous, which ultimately improves the 
learning by machine learning algorithms. To prove that shifting 
the angles actually work or not, we developed two models, one 
without angle shifting and other with angle shifting using SPINE X 
dataset. It was observed that the model developed with shifted 
angles has 10° lower MAE as in case of glycine* (data not shown). 
We also observed that shifting the phi dihedral angle improved the 
MAE in case of glycine. 

There are amino acids in which angle shifting does not increase 
the performance because they have minimal residues in the — 100° 
to — 180° ranges. Thus shifting of angles makes no difference as in 
the case of alanine (Figure 3). For graphs showing the dihedral 
angles distribution of all 20 amino acids, please refer to File SI and 
complete details are found in (Table S4, S5 in File S2). We have 
also observed in the developed models on SPINE X dataset that 
angle shifting produce negligible difference for alanine (data not 
shown). 

Discussion 

One of the advantages of prediction of dihedral angles of 
residues over secondary structure state is that they can be 
effectively used as restraints for building tertiary structure of 
proteins. In the past, methods were developed to predict real value 
of dihedral angles of residues in a protein. The assessment of the 
performance of a method/technique plays a vital role in the 
development of any field of science. It is important for users as well 
as developers, since it allow users to find the best method for their 
work and for the developers to compare their method with existing 
methods. In this study, an attempt has been made to assess the 
performance of existing methods in the field of dihedral prediction. 
We benchmarked the performance of SPINE X and ANGLOR in 
this study. The performance of these methods was evaluated on 
datasets used in the past as well as on new dataset called 
independent dataset generated using ccPDB server. TANGLE 
method was compared with ANGLOR and SPINE X on only 
ANGLOR dataset because of its reported results on this dataset 
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